<html>
<head>
<link rel="shortcut icon" href="./favicon.ico">
<link rel="stylesheet" type="text/css" href="./style.css">
<link rel="canonical" href="./Pipeline_Skid_Buffer.html">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Decouples two sides of a ready/valid handshake to allow back-to-back transfers without a combinational path between input and output, thus pipelining the path. *Can function as a two-entry Circular Buffer.*">
<title>Pipeline Skid Buffer</title>
</head>
<body>

<p class="inline bordered"><b><a href="./Pipeline_Skid_Buffer.v">Source</a></b></p>
<p class="inline bordered"><b><a href="./legal.html">License</a></b></p>
<p class="inline bordered"><b><a href="./index.html">Index</a></b></p>

<h1>Pipeline Skid Buffer</h1>
<p>Decouples two sides of a ready/valid handshake to allow back-to-back
 transfers without a combinational path between input and output, thus
 pipelining the path. <em>Can function as a two-entry Circular Buffer.</em></p>
<p>A skid buffer is the smallest <a href="./Pipeline_FIFO_Buffer.html">Pipeline FIFO
 Buffer</a>, with only two entries.  It is useful
 when you need to pipeline the path between a sender and a receiver for
 concurrency and/or timing, but not to smooth-out data rate mismatches.  It
 also only requires two data registers, which at this scale is smaller than
 LUTRAMs or Block RAMs (depending on implementation), and has more freedom
 of placement and routing.</p>
<p>Alternately, a Skid Buffer is also known as a Carloni Buffer. For reference,
 see Abbas and Betz <a href="./reading.html#elastic">"Latency Insensitive Design Styles for
 FPGAs"</a> (FPL, 2018).</p>
<h2>Background</h2>
<p>Networks-on-Chip (NoC) and elastic pipelines have as a fundamental building
 block a handshaking mechanism where each end of a link can signal if they
 have data to send ("valid"), or if they are able to receive data ("ready").
 When both ends agree (valid and ready both high), a data transfer occurs on
 that clock cycle.</p>
<p>However, pipelining handshaking is more complicated: simply adding
 a pipeline register to the valid, ready, and data lines will work, but now
 each transfer take two cycles to start, and two cycles to stop.  This isn't
 bad in terms of bandwidth if you know you can transfer a block of data per
 handshake, but now the receiver has to be aware of how many pipeline stages
 exist between it and the sender, and thus must have sufficient internal
 buffering to absorb the data that keeps arriving after it signals it is no
 longer ready to receive more data.</p>
<p>This is the basis of credit-based connections (which I'm not getting into
 here), which maximize bandwidth over long pipelines, but are overkill if you
 simply need to add a single pipeline stage between two ends, without having to
 modify them, so as to meet timing or allow each end to send off one item of
 data without having to wait for a response (thus overlapping communication and
 computation, which is desirable).</p>
<p>This fundamental pipeline block is the <em>skid buffer</em>.</p>
<h2>Figuring Out The Requirements</h2>
<p>To begin designing a skid buffer, let's imagine a single unit which can
 perform a valid/ready handshake and receive an input item of data, then
 performs the same handshake with the other end to output the data. </p>
<pre><code>Input                       Output
-----                       ------
          -------------
ready &lt;--|             |&lt;-- ready
valid --&gt;| Skid Buffer |--&gt; valid
data  --&gt;|             |--&gt; data
          -------------
</code></pre>
<p>Ideally, the input and output interfaces operate concurrently for maximum
 bandwidth: in the same clock cycle, a new data item is received on the
 input interface and put into a register, and that same register is
 simultaneously read out by the output interface. However, if the output
 interface is not transfering data on a given cycle, the input interface
 must not transfer data during that cycle also, else we will overwrite the
 data register before it was read out. To avoid this problem, the input
 interface should declare itself not ready in the same cycle as the output
 interface declaring itself not ready. But this forms a direct combinational
 connection between them, not a pipelined one.  <em>If we could connect both
 interfaces directly, and not affect timing or concurrency, we wouldn't need
 pipelining in the first place!</em></p>
<p>To resolve this contradiction, we need an extra buffer register to capture
 the incoming data during a clock cycle where the input interface is
 transferring data, but the output interface isn't, and there is already
 data in the main register. Then, in the next cycle, the input interface can
 signal it is no longer ready, and no data gets lost. We can imagine this
 extra buffer register as allowing the input interface to "skid" to a stop,
 rather than stopping immediately, which we'd previously found contradicts
 our pipelining requirements.</p>
<h2>Circular Buffer Mode</h2>
<p>Normally, a Skid Buffer reads in one value and read out one value each
 cycle. Should there be a stall at the output, the Skid Buffer fills up
 after one more input handshake and will not complete another input
 handshake until a value has been read out, causing a one-cycle stall at the
 input.  You can think of this as buffering the <em>earliest</em> values from the
 pipeline.</p>
<p>Setting <code>CIRCULAR_BUFFER</code> parameter to a non-zero value changes the
 behaviour at the input: the input handshake can always complete, discarding
 the earlier data already at the buffer output even if it is never read out
 and replacing it with the next previously buffered value.  You can think of
 this as buffering the <em>latest</em> values from the pipeline.  This is
 a two-entry circular buffer.</p>
<p>Contrary to normal operation, simultaneous input and ouput handshakes are
 possible on a full Skid Buffer in Circular Buffer Mode, giving full
 throughput with 2 cycles of latency. This is possible since <code>input_ready</code>
 no longer depends on the empty/full state of the buffer (which forces
 alternation of input and output handshakes), nor on the state of the output
 handshake (which is disallowed to prevent creating a combinational path
 between input and output).</p>

<pre>
`default_nettype none

module <a href="./Pipeline_Skid_Buffer.html">Pipeline_Skid_Buffer</a>
#(
    parameter WORD_WIDTH                = 0,
    parameter CIRCULAR_BUFFER           = 0     // non-zero to enable
)
(
    input   wire                        clock,
    input   wire                        clear,

    input   wire                        input_valid,
    output  wire                        input_ready,
    input   wire    [WORD_WIDTH-1:0]    input_data,

    output  wire                        output_valid,
    input   wire                        output_ready,
    output  wire    [WORD_WIDTH-1:0]    output_data
);

    localparam WORD_ZERO = {WORD_WIDTH{1'b0}};
</pre>

<h2>Data Path</h2>
<p>Feed the data_out register either from the input_data, or a buffered copy of
 input_data. Write to registers only if enabled by control.</p>
<p>Funneling into a single data_out register rather than selecting between two
 equal output registers avoids a mux after registers, fed by two data
 streams (thus more routing and delay). A single output register also
 retimes more easily into downstream logic.</p>
<p>Set up the default control values to match the "empty" state of the skid
 buffer, so the first input_data to arrive ends up in the data_out by
 default.  We don't have to worry about state here, just pass the data
 through unless told otherwise.</p>

<pre>
    reg                     data_buffer_wren = 1'b0; // EMPTY at start, so don't load.
    wire [WORD_WIDTH-1:0]   data_buffer_out;

    <a href="./Register.html">Register</a>
    #(
        .WORD_WIDTH     (WORD_WIDTH),
        .RESET_VALUE    (WORD_ZERO)
    )
    data_buffer_reg
    (
        .clock          (clock),
        .clock_enable   (data_buffer_wren),
        .clear          (clear),
        .data_in        (input_data),
        .data_out       (data_buffer_out)
    );

    reg                     data_out_wren       = 1'b1; // EMPTY at start, so accept data.
    reg                     use_buffered_data   = 1'b0;
    reg [WORD_WIDTH-1:0]    selected_data       = WORD_ZERO;

    always @(*) begin
        selected_data = (use_buffered_data == 1'b1) ? data_buffer_out : input_data;
    end

    <a href="./Register.html">Register</a>
    #(
        .WORD_WIDTH     (WORD_WIDTH),
        .RESET_VALUE    (WORD_ZERO)
    )
    data_out_reg
    (
        .clock          (clock),
        .clock_enable   (data_out_wren),
        .clear          (clear),
        .data_in        (selected_data),
        .data_out       (output_data)
    );
</pre>

<h2>Control Path</h2>
<p>We separate the control path so the associated data path does not have to
 know anything about the current state or its encoding.</p>
<p>This FSM assumes the usual meaning and behaviour of valid/ready handshake
 signals: when both are high, data transfers at the end of the clock cycle.
 It is an error to raise ready when not able to accept data (thus losing the
 incoming data), or to raise valid when not able to send data (thus
 duplicating previously sent data). <em>These error situations are not
 handled.</em></p>
<p>To operate our datapath as a skid buffer, we need to understand which
 states we want to allow it to be in, and which state transitions we also
 allow. This skid buffer has three states:</p>
<ol>
<li>It is Empty.</li>
<li>It is Busy, holding one item of data in the main register, either waiting
    or actively transferring data through that register.</li>
<li>It is Full, holding data in both registers, and stopped until the main
    register is emptied and simultaneously refilled from the buffer register, so no
    data is lost or reordered.  (Without an available empty register, the
    input interface cannot skid to a stop, so it must signal it is not ready.)</li>
<li>It is Full and in Circular Buffer Mode, holding data in both registers,
    and can accept new data into the buffer register while simultaneously
    replacing the contents of the main register with the current contents of
    the buffer register.</li>
</ol>
<p>The operations which transition between these states are:</p>
<ul>
<li>the input interface inserting a data item into the datapath (<code>+</code>)</li>
<li>the output interface removing a data item from the datapath (<code>-</code>)</li>
<li>both interfaces inserting and removing at the same time (<code>+-</code>)</li>
</ul>
<p>We also descriptively name each transition between states. These names will
 show up later in the code.</p>
<pre>
                 /--\ +- flow
                 |  |
          load   |  v   fill
 -------   +    ------   +    ------        (CBM)
|       | ---> |      | ---> |      | ---\ +  dump
| Empty |      | Busy |      | Full |    |   or
|       | <--- |      | <--- |      | <--/ +- pass
 -------    -   ------    -   ------
         unload         flush
</pre>

<p>We can see from the resulting state diagram that when the datapath is empty,
 it can only support an insertion, and when it is full, it can only support
 a removal, unles in Circular Buffer Mode (CBM) where it can support
 insertion and removal when full. <em>These constraints will become very
 important later on.</em> Normally, if the interfaces try to remove while Empty,
 or insert while Full, data will be duplicated or lost, respectively.</p>
<p>This simple FSM description helped us clarify the problem, but it also
 glossed over the potential complexity of the implementation: 3 states, each
 connected to 2 signals (valid/ready) per interface, for a total of 16
 possible transitions out of each state, or 48 possible state transitions
 total. The Circular Buffer Mode does not introduce a new state, as it is
 an elaboration-time parameter, not a run-time input.</p>
<p>We don't want to have to manually enumerate all the transitions to then
 coalesce the equivalent ones and rule out all the impossible or illegal
 ones. Instead, if we express in logic the constraints on removals and
 insertions we determined from the state diagram, and the possible
 transformations on the datapath, we then get the state transition logic and
 datapath control signal logic almost for free.</p>
<p>Lets describe the possible states of the datapath, and initialize it.  This
 code describes a binary state encoding, but the CAD tool can re-encode and
 re-number the state encoding. Usually this is beneficial, but if the
 states+inputs fit in a single LUT, forcing binary encoding reduces area.
 See what works best (i.e.: reaches the highest speed) for your given FPGA.</p>

<pre>
    localparam STATE_BITS = 2;

    localparam [STATE_BITS-1:0] EMPTY = 'd0; // Output and buffer registers empty
    localparam [STATE_BITS-1:0] BUSY  = 'd1; // Output register holds data
    localparam [STATE_BITS-1:0] FULL  = 'd2; // Both output and buffer registers hold data
    // There is no case where only the buffer register would hold data.

    // No handling of erroneous and unreachable state 3.
    // We could check and raise an error flag.

    wire [STATE_BITS-1:0] state;
    reg  [STATE_BITS-1:0] state_next = EMPTY;
</pre>

<p>Now, let's express the constraints we figured out from the state diagram:</p>
<ul>
<li>The input interface can only insert when the datapath is not full.</li>
<li>The output interface can only remove data when the datapath is not empty,
   except in Circular Buffer Mode, where it can also insert.</li>
</ul>
<p>We do this by computing the allowable output read/valid handshake signals
 based on the datapath state. We use <code>state_next</code> so we can have nice
 registered outputs. This little bit of code prunes away a large number of
 invalid state transitions. If some other logic seems to be missing, first
 see if this code has made it unnecessary.</p>
<p><em>This tiny bit of code is critical</em> since it also implies the fundamental
 operating assumptions of a skid buffer: that one interface cannot have its
 current state depend on the current state of the other interface, as that
 would be a combinational path between both interfaces.</p>
<p>Compute <code>ready</code> for the input interface. In Circular Buffer Mode, the input
 interface is always ready.</p>

<pre>
    <a href="./Register.html">Register</a>
    #(
        .WORD_WIDTH     (1),
        .RESET_VALUE    (1'b1) // EMPTY at start, so accept data
    )
    input_ready_reg
    (
        .clock          (clock),
        .clock_enable   (1'b1),
        .clear          (clear),
        .data_in        ((state_next != FULL) || (CIRCULAR_BUFFER != 0)),
        .data_out       (input_ready)
    );
</pre>

<p>Compute <code>valid</code> for the output interface</p>

<pre>
    <a href="./Register.html">Register</a>
    #(
        .WORD_WIDTH     (1),
        .RESET_VALUE    (1'b0)
    )
    output_valid_reg
    (
        .clock          (clock),
        .clock_enable   (1'b1),
        .clear          (clear),
        .data_in        (state_next != EMPTY),
        .data_out       (output_valid)
    );
</pre>

<p>After, let's describe the interface signal conditions which implement our
 two basic operations on the datapath: insert and remove. This also weeds
 out a number of possible state transitions.</p>

<pre>
    reg insert = 1'b0;
    reg remove = 1'b0;

    always @(*) begin
        insert = (input_valid  == 1'b1) && (input_ready  == 1'b1);
        remove = (output_valid == 1'b1) && (output_ready == 1'b1);
    end
</pre>

<p>Now that we have our datapath states and operations, let's use them to
 describe the possible transformations to the datapath, and in which state
 they can happen.  You'll see that these exactly describe each of the
 5 edges in the state diagram (7 in Circular Buffer Mode), and since we've
 pruned the space of possible interface conditions, we only need the minimum
 logic to describe them, and this logic gets re-used a lot later on,
 simplifying the code.</p>

<pre>
    reg load    = 1'b0; // Empty datapath inserts data into output register.
    reg flow    = 1'b0; // New inserted data into output register as the old data is removed.
    reg fill    = 1'b0; // New inserted data into buffer register. Data not removed from output register.
    reg flush   = 1'b0; // Move data from buffer register into output register. Remove old data. No new data inserted.
    reg unload  = 1'b0; // Remove data from output register, leaving the datapath empty.
    reg dump    = 1'b0; // New inserted data into buffer register. Move data from buffer register into output register. Discard old output data. (CBM)
    reg pass    = 1'b0; // New inserted data into buffer register. Move data from buffer register into output register. Remove old output data.  (CBM)

    always @(*) begin
        load    = (state == EMPTY) && (insert == 1'b1) && (remove == 1'b0);
        flow    = (state == BUSY)  && (insert == 1'b1) && (remove == 1'b1);
        fill    = (state == BUSY)  && (insert == 1'b1) && (remove == 1'b0);
        unload  = (state == BUSY)  && (insert == 1'b0) && (remove == 1'b1);
        flush   = (state == FULL)  && (insert == 1'b0) && (remove == 1'b1);
        dump    = (state == FULL)  && (insert == 1'b1) && (remove == 1'b0) && (CIRCULAR_BUFFER != 0);
        pass    = (state == FULL)  && (insert == 1'b1) && (remove == 1'b1) && (CIRCULAR_BUFFER != 0);
    end
</pre>

<p>And now we simply need to calculate the next state after each datapath
 transformations:</p>

<pre>
    always @(*) begin
        state_next = (load   == 1'b1) ? BUSY  : state;
        state_next = (flow   == 1'b1) ? BUSY  : state_next;
        state_next = (fill   == 1'b1) ? FULL  : state_next;
        state_next = (flush  == 1'b1) ? BUSY  : state_next;
        state_next = (unload == 1'b1) ? EMPTY : state_next;
        state_next = (dump   == 1'b1) ? FULL  : state_next;
        state_next = (pass   == 1'b1) ? FULL  : state_next;
    end

    <a href="./Register.html">Register</a>
    #(
        .WORD_WIDTH     (STATE_BITS),
        .RESET_VALUE    (EMPTY)         // Initial state
    )
    state_reg
    (
        .clock          (clock),
        .clock_enable   (1'b1),
        .clear          (clear),
        .data_in        (state_next),
        .data_out       (state)
    );
</pre>

<p>Similarly, from the datapath transformations, we can compute the necessary
 control signals to the datapath. These are not registered here, as they end
 at registers in the datapath.</p>

<pre>
    always @(*) begin
        data_out_wren     = (load  == 1'b1) || (flow == 1'b1) || (flush == 1'b1) || (dump == 1'b1) || (pass == 1'b1);
        data_buffer_wren  = (fill  == 1'b1)                                      || (dump == 1'b1) || (pass == 1'b1);
        use_buffered_data = (flush == 1'b1)                                      || (dump == 1'b1) || (pass == 1'b1);
    end

endmodule
</pre>

<p>For a 64-bit connection, the resulting skid buffer uses 128 registers for
 the buffers, 4 to 9 registers (and associated LUTs) for the FSM and
 interface outputs, depending on the particular state encoding chosen by the
 CAD tool, and easily reaches a high operating speed.</p>

<hr>
<p><a href="./index.html">Back to FPGA Design Elements</a>
<center><a href="https://fpgacpu.ca/">fpgacpu.ca</a></center>
</body>
</html>

